Housing Market - Seattle Area
Housing Market - Seattle Area
- 1 Chapter 1: Introduction
- 2 Chapter 2: Description of the Data
- 3 Chapter 3: Independent Variables EDA: Slicing the Data for an Overview
- 4 Chapter 4: Independent Variables EDA: Boxplots, Scatterplots, ANOVA, & Chi-Square
- 5 Chapter 5: Multiple Linear Regression Model
- 6 Chapter 6: Conclusion
- 7 Bibliography
1 Chapter 1: Introduction
Seattle boasts among the “hottest” housing markets in the United States; as of July 2018, Seattle “led the nation in home price gains” for 21 straight months, topped only by Portland in the 1990s – a trend driven by the city’s tech sector and a lack of supply compared with demand. Given the Seattle housing market’s notoriety for high prices, we were interested in exploring which variables affect housing price in this market. With this goal in mind, we found a public dataset on Kaggle (“House Sales in King County, USA”) offering 21,613 observations across 21 variables. According to Kaggle, it “includes homes sold between May 2014 and May 2015.” Although it doesn’t explore macro-level variables affecting housing price (such as the local job market, Amazon presence, etc.), it does focus on micro-level variables, such as renovations, number of bedrooms, square feet of living space, etc. that are common to virtually all housing markets in the United States. As a result, our analysis could lay the groundwork for future comparative analysis with other housing markets across the country.
This report is organized as follows:
- Description of the Data (explanation of the dataset and its variables,
- Geographic Coverage of Data
- Independent Variables EDA: Slicing the Data for an Overview
- Independent Variables EDA: Boxplots, Scatterplots, ANOVA, & Chi-Square
- Multiple Linear Regression Model
- Conclusion
2 Chapter 2: Description of the Data
2.1 Source Data
As mentioned previously, our dataset houses 21,613 observations across 21 variables. (See below for a readout of the dataset’s structure and variable names.) Variable descriptions are as follows and come from the following link; astericks next to variable name indicates usage in our analysis:
## 'data.frame': 21613 obs. of 21 variables:
## $ id : num 7129300520 6414100192 5631500400 2487200875 1954400510 ...
## $ date : Factor w/ 372 levels "20140502T000000",..: 165 221 291 221 284 11 57 252 340 306 ...
## $ price : num 221900 538000 180000 604000 510000 ...
## $ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
## $ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
## $ floors : num 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : int 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
## $ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
## $ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
## $ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
## $ lat : num 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
## $ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
- Id (unique ID for each home sold)
- Date (date of the home sale)
- Price (price of each home sold)***
- Bedrooms (number of bedrooms)***
- Bathrooms (number of bathrooms, where .5 accounts for a room with a toilet but no shower)
- Sqft_living (square footage of the apartments’ interior living space)***
- Sqft_lot (square footage of the land space)***
- Floors (number of floors)
- Waterfront (a dummy variable for whether the apartment was overlooking the waterfront or not; 0 represents no waterfront)
- View (an index from 0 to 4 of how the view of the property was)
- Condition (an index from 1 to 5 on the condition of the apartment, with the lowest number representing poor condition)***
- Grade (an index from 1 to 13, with the lowest number representation poor construction and design)***
- Sqft_above (the the square footage of the interior housing space that is above ground level)
- Sqft_basement (the square footage of the interior housing space that is below ground level)
- yr_built (the year the house was initially built)***
- yr_renovated (the year of the house’s last renovation)***
- zipcode (what zipcode area the house is in)***
- Lat (latitude)***
- Long (longitude)***
- Sqft_living15 (the square footage of the interior housing living space for the nearest 15 neighbors)
- Sqft_lot15 (the square footage of the land lots of the nearest 15 neighbors)
For our exploratory data analysis, we ignored “Id” and “Date” because these are independent variables with no relation to price. We also ignored “floors” because it can be considered a proxy for sqft_living. “Waterfront” and “View” were dropped because the vast majority of properties were coded as “0”. We ignored sqft_basement and sqft_above because they were corollaries of “sqft_living” (we didn’t want redundancy in our analysis). We also ignored “sqft_living 15” and “sqft_lot15” because we were interested only in the attributes of individual houses, not those of their surrounding neighborhoods for our initial EDA.
Following these decisions, we cleaned the data accordingly: we dropped “waterfront” and “view”; we subsetted the dataset to include only properties with more than 0 bedrooms and bathrooms (we considered these “outlier” properties); we subsetted the dataset to include only properties with less than 30 bedrooms (given the likely mistake of recording that many rooms in much smaller houses in terms of sqft); we dropped “NA” values from the dataset to simplify our analysis (“NA” values are hard to perform operations on); we converted “condition” and “grade” into factor variables because they are effectively intervals; and we ran “housing price” through a logarithmic function to make for better visualization.
2.2 Geographic Coverage of Data
Below is a visualization of the points in the dataset by price, plotted with the leaflet library. Note that the data have been divided by unequal bins to provide a better visualization of the distribution of housing price, so please read the legend carefully. More expensive houses tend to be concentrated near the water and center of the city.
Here instead is a visualization of the observations by property lot sqft. Again, data have been divided by unequal bins to provide a better visualization of the distribution of housing price, so please read the legend carefully. Our observation follows common sense: the further one ventures outside the city center, the more land there is.
3 Chapter 3: Independent Variables EDA: Slicing the Data for an Overview
3.1 House Price Distribution
A brief overview of the dataset yields the following observations for housing price: the minimum price is $78,000, while the maximum is $7,700,000 (quite a large range); the mean of the dataset is $540,198 (indicating that the dataset is right-skewed, as further indicated by the histogram below); the standard deviation of the dataset is $367,142; and the variance is 134,792,956,735 (quite large, indicating “that the data points are very spread out from the mean, and from one another”.
Just for context, the following readouts offer cross sections of Seattle’s most expensive houses; average prices for each condition level; and average prices for each grade level.
3.1.1 What do the most expensive houses look like?
From slicing the data, it looks like the most expensive houses are very well constructed, have tens of thousands of square feet of property, and have 5 or more bedrooms.
3.1.2 What is the average price for each condition level?
From the slice below, average prices seem to trend upward along with condition; average prices are in the hundreds of thousands.
## condition price
## 1 1 341067
## 2 2 328149
## 3 3 542089
## 4 4 521274
## 5 5 612402
3.1.3 What is the average price for each grade level?
From the slice below, we can see that price generally trends upward along with grade.
## grade price
## 1 3 262000
## 2 4 212002
## 3 5 248524
## 4 6 301920
## 5 7 402566
## 6 8 542944
## 7 9 773513
## 8 10 1071771
## 9 11 1496842
## 10 12 2201285
## 11 13 3709615
3.2 Bedrooms, sqft_lot, & sqft_living
Below we have included histograms for “bedrooms”, “sqft_living”, and “sqft_lot”. Upon inspecting the graphs, it becomes clear that most of the properties in this dataset have around 3 bedrooms, while the majority of properties are around 1000-2000 square feet (for reference, in 2015, the average US house size was around 2,600 square feet); as for sqft_lot, most of the properties have between 5,000 and 10,000 square feet of land. As with the housing price histogram shown earlier, these histograms are right-skewed.
3.2.1 What do the largest houses look like?
Going into this project, we hypothesized that larger houses would be priced higher than smaller houses. House size is determined in large part by “sqft_living”, of which “bathrooms” and “bedrooms” are a part.
It is apparent here that the largest houses are also among the most expensive – they are all priced in the millions of dollars, which are outliers when compared to the dataset as a whole.
The properties with the largest amount of land are also priced highly, but not as highly as those listed in the “sqft_living” readout. This could suggest a lower correlation between housing price and sqft_lot than that between housing price and sqft_living.
The properties with the largest number of bedrooms are also priced highly (around or above the dataset mean of $540,000), but not quite as highly as those in the “sqft_living” readout. Since “bedrooms” contributes in part – but not in whole – to sqft_living, it makes sense that its correlation with housing price is lower than that of sqft_living.4 Chapter 4: Independent Variables EDA: Boxplots, Scatterplots, ANOVA, & Chi-Square
4.1 SMART Question: Are houses of different sizes priced differently?
Now that we’ve taken a look at slices of the data, we can now delve deeper with some graphs. Below are scatterplots and boxplots of housing price vs. “sqft_living”.
4.1.1 Comparison of sqft living with price
From the scatterplot, it’s apparent that there is a relatively strong, positive correlation between housing price and living space (.70192, to be exact). That is, as living space increases, so does housing price. Note that a majority of the data points lie below 6,000 sqft, and below $2 million.
Now let’s take a look at the same data with a boxplot; this time, we have “sqft_living” categorized by 5 intervals. From this visualization as well, it’s apparent that “sqft_living” correlates positively with housing price. The last interval (10,891-13,540 sqft) seems to buck this trend, but it worth noting that only 2 houses are part of this group – a small-n population, which could explain the discrepancy.
Next, we perform the BP Test to determine the homoscedasticity of the different groups to assess if we can perfom an ANOVA test for the differences in mean values between different houses’ living space.
H0: The variances of prices are the same across the different sqft_living levels.
H1: The variances of prices are different across the different sqft_living levels.
Since the p-value is 0, which is lower than 0.05, we need to reject the null hypothesis. The variances for different sqft_living levels are different. Thus, the ANOVA test is not applicable.
4.1.2 Comparison of sqft lot with price
Next up in our exploratory data analysis is housing price vs. “sqft_lot”. How does land area correlate with housing price? According to our scatterplot, not very highly – there is a positive correlation of only .08988. This seems to suggest that sqft_lot is more weakly related to housing price than sqft_living. Indeed, the vast majority of data points in the scatterplot seem to trend upward in price with relatively small increases in land area.
Next, let’s take a look at the same data in a boxplot. Unfortunately, the visualization isn’t very readable; let’s convert housing price through a logarithmic function to improve our y-axis scale.
The modified boxplot below (with the logarithmic scale) is much easier to interpret. We can see that housing price increases as land area increases, but only to an extent. Note that houses in the 991,000-1.3M sqft and 1.3M-1.65M sqft ranges appear to buck the trend. Once again, this can be explained by the fact that only a few houses are part of these two intervals – only 4 to be exact.
Thus, we perform the BP Test to determine the homoscedasticity of the different groups to assess if we can perfom an ANOVA test for the differences in mean values between different houses’ living space.
BP test:
H0: The variances of prices are the same across the different sqft_lot levels.
H1: The variances of prices are different across the different sqft_lot levels.
ANOVA test:
H0: There are no differences between the mean prices of the different sqft.lot levels.
H1: The mean prices of the different sqft.lot levels are different.
##
## studentized Breusch-Pagan test
##
## data: kc_house_data$price ~ sqft.lot
## BP = 0.2, df = 4, p-value = 1
## Df Sum Sq Mean Sq F value Pr(>F)
## sqft.lot 4 3896930121642 974232530411 7.24 0.0000081 ***
## Residuals 21591 2906956970577216 134637440164
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = kc_house_data$price ~ sqft.lot)
##
## $sqft.lot
## diff lwr upr p adj
## 330K-660K-520-330K 130987 -15181 277155 0.104
## 660K-991K-520-330K 619510 265542 973477 0.000
## 991K-1.3M-520-330K -10511 -588471 567449 1.000
## 1.3M-1.65M-520-330K 160322 -840687 1161331 0.992
## 660K-991K-330K-660K 488523 105684 871361 0.005
## 991K-1.3M-330K-660K -141498 -737577 454580 0.967
## 1.3M-1.65M-330K-660K 29335 -982244 1040914 1.000
## 991K-1.3M-660K-991K -630021 -1307691 47650 0.083
## 1.3M-1.65M-660K-991K -459187 -1520893 602518 0.763
## 1.3M-1.65M-991K-1.3M 170833 -985006 1326672 0.994
Note that the p-value of BP test is 0.996, which is greater than 0.05. Thus, the variances are the same for different groups. Then we use the ANOVA test to analyze the differences of mean prices. Since the p-value is 0.000008033, which is lower than 0.05, we need to reject the null hypothesis. The mean prices of the different sqft.lot levels are different. From the Turkey test, we can see that the p-values for the following pairs are the lowest: the 660k-991k and 520k-330k, 660k-991k and 330K-660K, 660k-991k and 991K-1.3M. And the p-value for other pairs are very high, which means the mean price of 660k-991k(the median level) are higher than others’. Thus, the land area of 660k-991k are the most popular level for houses in WA.
4.1.3 Comparison of number of bedrooms with price
Here, we have a logarithmic box plot of housing price vs. “bedrooms”. There appears to be a clear trend: as the number of bedrooms increases, housing price increases as well. The 9-11 interval bucks the trend slightly, but again, this can be explained by the fact that only 10 houses are part of this interval, compared with 21586 total for the others.
Next, we perform the BP Test to determine the homoscedasticity of the different groups to assess if we can perfom an ANOVA test for the differences in mean values between different houses’ living space.
H0: Price variance is the same across bedroom groups.
H1: Price variance is different across bedroom groups.
Since the p-value is 0, which is lower than 0.05, we need to reject the null hypothesis. The variances for different bedroom groups are different. Thus, the ANOVA test is not applicable.
Since the anova test is not applicable, we want to use chi-square test to see the correlation between the number of bedrooms and house price. The prices and number of bathrooms are divided into different groups by the cut function to make it categorical, so we can perform the chi-square test on them.
H0: The price and number of bedrooms are independent.
H1: The price and number of bedrooms are not independent.
##
## Pearson's Chi-squared test
##
## data: bed_p
## X-squared = 2287, df = 21, p-value <0.0000000000000002
##
## Pearson's Chi-squared test
##
## data: bath_p
## X-squared = 4534, df = 21, p-value <0.0000000000000002
Since both p-values are lower than 0.05, we need to reject the null hypothesis. Thus, the number of bedrooms and the number of bathrooms are not independent from the price of house. They are correlated. Generally, the price increases when the number of bedrooms and bathrooms become greater.
4.2 SMART Question: Are houses of different quality priced differently?
4.2.1 Comparison of condition with price
Here, we compare “condition” with housing price. Once again, “condition” represents an index from 1 to 5, with the lowest number representing poor condition. Once we take a look at the boxplot below (the second one is logarithmized for clearer visualization), it becomes clear that apartment condition correlates positively with housing price.
Then we do the BP test to see the homoscedasticity for different condition groups.
BP test:
H0: Price variance is the same accross different conditions.
H1: Price variance is different accross different conditions.
Since the p-value is 0.292, which is greater than 0.05, we failed to reject the null hypothesis. The price variance between different condition groups is the same. The ANOVA test is applicable.
Next we do the ANOVA test to see the mean prices of houses with different condition.
ANOVA test:
H0: There are no differences between the mean prices of different conditions.
H1: The mean prices of the different conditions are not equal.
## Df Sum Sq Mean Sq F value
## kc_house_data$condition 4 19739857713567 4934964428392 36.9
## Residuals 21591 2891114042985296 133903665554
## Pr(>F)
## kc_house_data$condition <0.0000000000000002 ***
## Residuals
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = kc_house_data$price ~ kc_house_data$condition)
##
## $`kc_house_data$condition`
## diff lwr upr p adj
## 2-1 -12918 -213478 187642 1.000
## 3-1 201022 15459 386585 0.026
## 4-1 180207 -5637 366051 0.062
## 5-1 271335 84389 458280 0.001
## 3-2 213940 136914 290965 0.000
## 4-2 193125 115424 270825 0.000
## 5-2 284253 203953 364552 0.000
## 4-3 -20815 -36519 -5111 0.003
## 5-3 70313 44676 95950 0.000
## 5-4 91128 63529 118727 0.000
Since the p-value is 0.0000000000000000000000000000008494, which is lower than 0.05, we need to reject the null hypothesis. The mean prices of the different conditions are not equal. From the Tukey test we know that conditions 1 and 2 have very large p-values. Other p-values are very low. Basically, the price increases when the condition is better.
4.2.2 Comparison of grade with price
Here we have a boxplot comparing “grade” with housing price. Once again, “grade” represents an index from 1 to 13, with the lowest number representing poor construction and design. The trend is clear: construction and design grade correlate positively with housing price.
Then we do BP test to see if the ANOVA test is applicable.
H0: The variances of prices are the same across different grades.
H1: The variances of prices are not equal across different grades.
Since the p-value is 0, which is lower than 0.05, we need to reject the null hypothesis. The price variance for different grades are not equal. Thus, the ANOVA test is not applicable.
Now we want to see whether condition and grade are somehow related; for this, we must perform a chi-square test.
H0: Condition and grade of houses are independent from each other.
H1: Condition and grade of house are not independent from each other.
##
## Pearson's Chi-squared test
##
## data: cond.tbl
## X-squared = 1457, df = 40, p-value <0.0000000000000002
Since the p-value is 0, which is lower than 0.05, we need to reject the null hypothesis. Thus, condition and grade are not independent. They are correlated. Generally, the grade of construction is better when house conditions are better.
4.3 SMART Question: Are older houses priced differently?
4.3.1 Comparison of year they were built with price
Here, we have a comparison of “yr_built” with housing price. Once we take a look at the logarithmic boxplot, we see no obvious trends. Housing price trends downward from 1900-1969, and then picks back up from 1970-2015. What might explain this? Well, “yr_built” does not take “yr_renovated” into account. For instance, two equivalent houses built in the same year could have different house prices, depending on if one has been renovated while the other hasn’t.
Let’s construct the same boxplots, but this time indexed by “yr_renovated”.
Unfortunately, the vast majority of properties in this dataset have never been renovated (20682 to be exact). This means that only 914 properties have been renovated. This makes the resulting boxplots somewhat uninformative – the larger population boxplot (not renovated) largely mirrors the patterns of the previous graph, and the smaller population graph (renovated) is based on a population too small to run meaningful analysis on. We have included the graphs here to showcase our thought process, but we are well aware of their limitations.
Thus, we want to get more insights into the significance of the differences we may observe throughout all these graphs and we then attempt an ANOVA test for renovated houses aggregated by yr_built.
We perform the BP Test to determine the homoscedasticity of the different groups to assess if we can perfom an ANOVA test for different periods of construction, which results in a high p-value allowing us to conduct the ANOVA for differences in price among older and newer renovated houses.
H0: There are no differences between the mean prices of the different yr_built renovated houses.
H1: The mean prices of the different yr_built renovated houses are different.
## Df Sum Sq Mean Sq F value Pr(>F)
## yr_built 4 3524693302886 881173325722 2.4 0.048 *
## Residuals 909 333684139527418 367089262406
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = price ~ yr_built, data = yr_R)
##
## $yr_built
## diff lwr upr p adj
## 1924-1946-1900-1923 -27818 -172093 116458 0.985
## 1947-1969-1900-1923 54244 -85395 193882 0.826
## 1970-1992-1900-1923 184015 -26330 394360 0.119
## 1993-2015-1900-1923 280784 -466793 1028361 0.843
## 1947-1969-1924-1946 82061 -57719 221841 0.495
## 1970-1992-1924-1946 211832 1394 422271 0.048
## 1993-2015-1924-1946 308601 -439002 1056205 0.792
## 1970-1992-1947-1969 129771 -77516 337059 0.428
## 1993-2015-1947-1969 226540 -520182 973263 0.922
## 1993-2015-1970-1992 96769 -666344 859881 0.997
Since the p-value is 0.04853, which is less than 0.05, we need to reject the null hypothesis. Thus, the prices of the renovated are different across yr_built. From the Tukey test, we can see that only the mean price of renovated houses built in 1970-1992 and in 1924-1946 are different from each other. Although given a p-value fairly higher compared to the rest obtained in this analysis, we can also notice that with a more conservative significance level we would not find differences significant among the various years of constructions of the houses renovated, which is reasonable given the fact that most houses were renovated around the same period and thus more likely to be priced similarly.
Even though we could not perform an ANOVA test for the overall sample of houses by yr_built because of bp test results in that case, the boxplots confirm the possibility that also there would be no significant differences in price between older and newer houses, regardless their renovation status provided in this sample.
4.3.2 Comparison of year renovated with price
Here, we’ve graphed housing price by yr_renovated itself. This graph also showcases only 914 properties – the ones that were renovated. Generally speaking, as “yr_renovated” approaches the present day, price increases. The exception is between the 1924-1946 and 1947-1969 intervals; note however, that only 9 properties occupy the first interval.
5 Chapter 5: Multiple Linear Regression Model
5.1 SMART Question: What factors influence the house price the most?
5.1.1 LSRL Model building
Below is our regression model, along with a comprehensive correlation plot.
5.1.1.1 First, take a look at all numeric variables and their correlation.
yr_built and sqft_lot seem unrelated to price as their correlation coefficient is almost 0; accordingly, we do not choose them as independent variables to predict house price.
##
## Call:
## lm(formula = price ~ . - sqft_lot - yr_built, data = h2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1706400 -142615 -21697 101727 4133425
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80549.20 7851.17 10.26 < 0.0000000000000002 ***
## bedrooms -64528.79 2446.40 -26.38 < 0.0000000000000002 ***
## bathrooms 5499.45 3852.17 1.43 0.15341
## sqft_living 340.68 4.99 68.33 < 0.0000000000000002 ***
## floors 14875.02 4299.43 3.46 0.00054 ***
## sqft_above -36.54 5.03 -7.26 0.0000000000004 ***
## sqft_basement NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 257000 on 21590 degrees of freedom
## Multiple R-squared: 0.51, Adjusted R-squared: 0.509
## F-statistic: 4.49e+03 on 5 and 21590 DF, p-value: <0.0000000000000002
The coefficient of “sqft_basement” is NA, which indicates it has a problem with the other variables, so we dropped this one. And the p-value of “bathroom” is too large (meaning it’s insignificant), so we dropped this one as well.
##
## Call:
## lm(formula = price ~ . - sqft_lot - yr_built - sqft_basement -
## bathrooms, data = h2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1711175 -142619 -21684 101736 4134675
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81162.16 7839.61 10.35 < 0.0000000000000002 ***
## bedrooms -63928.53 2410.05 -26.53 < 0.0000000000000002 ***
## sqft_living 343.98 4.42 77.89 < 0.0000000000000002 ***
## floors 17336.10 3938.78 4.40 0.000010807578329 ***
## sqft_above -37.41 5.00 -7.49 0.000000000000074 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 257000 on 21591 degrees of freedom
## Multiple R-squared: 0.51, Adjusted R-squared: 0.509
## F-statistic: 5.61e+03 on 4 and 21591 DF, p-value: <0.0000000000000002
## bedrooms sqft_living floors sqft_above
## 1.55 5.37 1.48 5.59
Everything looks better now; we also checked the VIF value of each variable and none of them is too large, indicating no multicollineraty. We then added the two factor variables (“grade” and “condition”) into the dataset to see their effects.
## price bedrooms sqft_living sqft_above
## Min. : 78000 Min. : 1.00 Min. : 370 Min. : 370
## 1st Qu.: 322000 1st Qu.: 3.00 1st Qu.: 1430 1st Qu.:1190
## Median : 450000 Median : 3.00 Median : 1910 Median :1560
## Mean : 540198 Mean : 3.37 Mean : 2080 Mean :1789
## 3rd Qu.: 645000 3rd Qu.: 4.00 3rd Qu.: 2550 3rd Qu.:2210
## Max. :7700000 Max. :11.00 Max. :13540 Max. :9410
##
## floors grade condition
## Min. :1.00 7 :8973 1: 29
## 1st Qu.:1.00 8 :6065 2: 170
## Median :1.50 9 :2615 3:14020
## Mean :1.49 6 :2038 4: 5677
## 3rd Qu.:2.00 10 :1134 5: 1700
## Max. :3.50 11 : 399
## (Other): 372
##
## Call:
## lm(formula = price ~ ., data = h3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1571904 -122108 -22410 88448 4645573
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119945.02 235012.20 0.51 0.60979
## bedrooms -27981.05 2255.58 -12.41 < 0.0000000000000002 ***
## sqft_living 229.90 4.37 52.66 < 0.0000000000000002 ***
## sqft_above -89.75 4.65 -19.29 < 0.0000000000000002 ***
## floors 25574.69 3798.02 6.73 0.000000000017 ***
## grade4 52746.97 235247.78 0.22 0.82259
## grade5 47680.72 231471.19 0.21 0.83680
## grade6 77263.78 231054.85 0.33 0.73808
## grade7 110396.50 231036.91 0.48 0.63278
## grade8 183631.54 231071.92 0.79 0.42680
## grade9 327826.56 231145.29 1.42 0.15613
## grade10 530666.16 231258.19 2.29 0.02176 *
## grade11 829031.75 231551.46 3.58 0.00034 ***
## grade12 1357863.56 232717.34 5.83 0.000000005462 ***
## grade13 2552196.47 240535.18 10.61 < 0.0000000000000002 ***
## condition2 -60791.74 46520.27 -1.31 0.19130
## condition3 -62199.90 43243.71 -1.44 0.15035
## condition4 -5587.43 43281.73 -0.13 0.89728
## condition5 71583.51 43532.93 1.64 0.10012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 231000 on 21577 degrees of freedom
## Multiple R-squared: 0.605, Adjusted R-squared: 0.604
## F-statistic: 1.83e+03 on 18 and 21577 DF, p-value: <0.0000000000000002
## bedrooms sqft_living sqft_above floors grade4 grade5
## 1.68 6.51 6.01 1.70 27.99 240.44
## grade6 grade7 grade8 grade9 grade10 grade11
## 1847.93 5250.37 4367.68 2302.97 1077.66 393.79
## grade12 grade13 condition2 condition3 condition4 condition5
## 90.02 14.10 6.85 172.49 147.02 55.66
The 5 levels of the “condition” variable are all insignificant, so we can drop the “condition” variable. For the “grade” variable, higher grade levels have significant effects on price. By contrast, low grade does not affect price significantly.
##
## Call:
## lm(formula = price ~ . - condition, data = h3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1594879 -122654 -26548 89616 4612348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 203437.68 234053.01 0.87 0.3848
## bedrooms -25531.89 2283.29 -11.18 < 0.0000000000000002 ***
## sqft_living 241.42 4.40 54.90 < 0.0000000000000002 ***
## sqft_above -101.49 4.69 -21.65 < 0.0000000000000002 ***
## floors 11331.97 3765.10 3.01 0.0026 **
## grade4 -58531.64 238316.99 -0.25 0.8060
## grade5 -47771.75 234518.50 -0.20 0.8386
## grade6 -24710.22 234099.87 -0.11 0.9159
## grade7 2597.97 234076.33 0.01 0.9911
## grade8 71562.72 234108.48 0.31 0.7598
## grade9 212286.01 234180.94 0.91 0.3647
## grade10 412631.72 234294.46 1.76 0.0782 .
## grade11 707325.79 234587.29 3.02 0.0026 **
## grade12 1233712.59 235767.03 5.23 0.00000017 ***
## grade13 2416245.65 243679.96 9.92 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 234000 on 21581 degrees of freedom
## Multiple R-squared: 0.594, Adjusted R-squared: 0.594
## F-statistic: 2.26e+03 on 14 and 21581 DF, p-value: <0.0000000000000002
## bedrooms sqft_living sqft_above floors grade4 grade5
## 1.68 6.43 5.94 1.63 27.97 240.31
## grade6 grade7 grade8 grade9 grade10 grade11
## 1846.94 5247.31 4365.02 2301.52 1076.98 393.53
## grade12 grade13
## 89.96 14.09
Now, we’ve added the interaction term into the model, since we want to see if the correlation of variables would affect the price prediction. We first put all interactions into the model to see what would happen.
##
## Call:
## lm(formula = price ~ . + bedrooms:sqft_living + bedrooms:floors +
## bedrooms:sqft_above + sqft_living:floors + sqft_living:sqft_above +
## floors:sqft_above, data = h4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3695612 -119737 -25450 86085 3346184
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 245214.63848 229404.26957 1.07
## bedrooms -51836.57174 7294.93923 -7.11
## sqft_living 86.13585 16.53152 5.21
## sqft_above -19.70424 21.95986 -0.90
## floors 36232.45799 14886.82743 2.43
## grade4 -45196.54696 233098.37846 -0.19
## grade5 -15410.52348 229421.70528 -0.07
## grade6 21127.44809 229036.61201 0.09
## grade7 74426.60451 229048.12877 0.32
## grade8 160884.19715 229093.63765 0.70
## grade9 310355.78334 229162.33811 1.35
## grade10 492220.91881 229263.64399 2.15
## grade11 721454.84294 229542.28824 3.14
## grade12 1105961.32670 230837.39838 4.79
## grade13 1837445.26776 239796.69406 7.66
## bedrooms:sqft_living -16.02514 3.80857 -4.21
## bedrooms:floors 25728.79001 5273.85824 4.88
## bedrooms:sqft_above 20.28226 5.16049 3.93
## sqft_living:floors 101.59363 8.63259 11.77
## sqft_living:sqft_above 0.03473 0.00195 17.79
## sqft_above:floors -177.53444 9.78398 -18.15
## Pr(>|t|)
## (Intercept) 0.2851
## bedrooms 0.000000000001233 ***
## sqft_living 0.000000190166631 ***
## sqft_above 0.3696
## floors 0.0149 *
## grade4 0.8463
## grade5 0.9464
## grade6 0.9265
## grade7 0.7452
## grade8 0.4825
## grade9 0.1757
## grade10 0.0318 *
## grade11 0.0017 **
## grade12 0.000001669853101 ***
## grade13 0.000000000000019 ***
## bedrooms:sqft_living 0.000025908368174 ***
## bedrooms:floors 0.000001076292883 ***
## bedrooms:sqft_above 0.000085104651817 ***
## sqft_living:floors < 0.0000000000000002 ***
## sqft_living:sqft_above < 0.0000000000000002 ***
## sqft_above:floors < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 229000 on 21575 degrees of freedom
## Multiple R-squared: 0.612, Adjusted R-squared: 0.611
## F-statistic: 1.7e+03 on 20 and 21575 DF, p-value: <0.0000000000000002
## bedrooms sqft_living sqft_above
## 17.9 95.0 136.2
## floors grade4 grade5
## 26.6 28.0 240.4
## grade6 grade7 grade8
## 1848.1 5252.3 4369.7
## grade9 grade10 grade11
## 2304.0 1078.0 393.9
## grade12 grade13 bedrooms:sqft_living
## 90.2 14.3 144.4
## bedrooms:floors bedrooms:sqft_above sqft_living:floors
## 70.0 194.0 150.2
## sqft_living:sqft_above sqft_above:floors
## 32.1 173.0
We dropped the insignificant interactions and some interactions would cause certain variables to be insignificant as well, so we also drop these variables. Here is what’s left; this model seems nice.
##
## Call:
## lm(formula = price ~ . + bedrooms:sqft_above + sqft_living:sqft_above,
## data = h4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3552516 -119684 -27362 86763 3536811
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 301536.38692 230960.70303 1.31
## bedrooms -37942.22923 5140.56726 -7.38
## sqft_living 174.31529 5.84404 29.83
## sqft_above -232.12312 9.01099 -25.76
## floors 14271.05585 3722.97985 3.83
## grade4 -35043.35004 235079.61029 -0.15
## grade5 9116.80210 231363.17493 0.04
## grade6 50285.38874 230969.99192 0.22
## grade7 107201.66962 230973.41942 0.46
## grade8 195978.96124 231016.35221 0.85
## grade9 342536.69129 231090.06081 1.48
## grade10 523989.90325 231190.83681 2.27
## grade11 751707.87220 231468.92597 3.25
## grade12 1141807.53068 232773.60628 4.91
## grade13 1933822.97431 241686.51914 8.00
## bedrooms:sqft_above 12.57344 2.48837 5.05
## sqft_living:sqft_above 0.02832 0.00169 16.75
## Pr(>|t|)
## (Intercept) 0.19171
## bedrooms 0.0000000000001629 ***
## sqft_living < 0.0000000000000002 ***
## sqft_above < 0.0000000000000002 ***
## floors 0.00013 ***
## grade4 0.88150
## grade5 0.96857
## grade6 0.82765
## grade7 0.64256
## grade8 0.39626
## grade9 0.13828
## grade10 0.02343 *
## grade11 0.00117 **
## grade12 0.0000009399814052 ***
## grade13 0.0000000000000013 ***
## bedrooms:sqft_above 0.0000004387237711 ***
## sqft_living:sqft_above < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 231000 on 21579 degrees of freedom
## Multiple R-squared: 0.605, Adjusted R-squared: 0.605
## F-statistic: 2.07e+03 on 16 and 21579 DF, p-value: <0.0000000000000002
## bedrooms sqft_living sqft_above
## 8.75 11.67 22.55
## floors grade4 grade5
## 1.64 27.97 240.39
## grade6 grade7 grade8
## 1847.89 5251.23 4368.71
## grade9 grade10 grade11
## 2303.51 1077.80 393.79
## grade12 grade13 bedrooms:sqft_above
## 90.13 14.24 44.35
## sqft_living:sqft_above
## 23.68
5.1.2 Final results and approved model for prediction of price
Price = 142000 +bedrooms(-31710+10.72sqft_above)+sqft_living(170+2.943sqft_above)+sqft_above(-228.6)+floors14570+grade()
Problem: As the price histogram above is quite left-skewed, it means there are many outliers whose price is very high in the dataset. While we built the model, we did not exclude the outliers as we considered these values important. As a result, our final model is also skewed a bit. It means that for low price houses, our model may predict higher-than-normal prices, and for high price houses, our model will predict lower-than-normal prices.
6 Chapter 6: Conclusion
This analysis provided many insights into the dynamics of the housing market in the Seattle area. After conducting EDA on most of the variables in the dataset, and after considering the possible interactions among them, we were able to narrow down better questions to assess what could actually cause price fluctuations for houses in the area. In particular, we realized how few houses had been renovated (or at least recorded as such); to account for the small differences in price between older and newer houses, we suppose that older houses may have some historical value, and are thus priced comparably to newer buildings.
As we proceeded with hypothesis analysis, we were able to confirm the expected answers to most of our questions: price was influenced by sqft, quality, and age (age only to an extent). Also notably, we found the following from our ANOVA and chi-square analysis: 1) different property sizes yield different housing prices; 2) different conditions yield different housing prices; 3) different yr_built yields different housing prices within the “renovated” subset (although considering the limited insights from this finding); 4) price is dependent on the number of bedrooms and bathrooms; 5) condition and grade are not independent. Finally, our regression model uncovered additional elements to predict housing price with (such as sqft_above and floors), making it possible to run through future datasets and adjust for better predictive power.
We believe that more extensive information on renovation may have provided useful insights to assess the actual correlation between yr_built and housing price. Further details on unclear variable definitions could have also made this analysis and the data behind it stronger and more replicable. Moving forward, it would be interesting to compare our model’s price predictions with actual prices from following years. It would also be interesting to replicate this study in different cities across the United States; would the same variables have the same effects on housing price elsewhere? Or do our results apply exclusively to Seattle? An additional study might also consider what variables affect the rate at which housing price change in Seattle and other American cities; although outside the scope of our study, this is an important question to consider, since the rate of housing price change could affect people’s perception of cities’ long-term livability, and thus the cities’ demographics themselves. Indeed, nothing less than the future viability of our cities is at stake here; this is why we have chosen to study this topic, and is why we hope others will choose to, as well.
7 Bibliography
Perry, M. J. (2016, June 5). New US homes today are 1,000 square feet larger than in 1973 and living space per person has nearly doubled. Retrieved from https://www.aei.org/carpe-diem/new-us-homes-today-are-1000-square-feet-larger-than-in-1973-and-living-space-per-person-has-nearly-doubled/
Roberts, D. (n.d.). Variance and Standard Deviation. Retrieved from https://mathbitsnotebook.com/Algebra1/StatisticsData/STSD.html
Rosenberg, M. (2018, July 31). Seattle-area home prices this spring rose at fastest rate since 2006 bubble. The Seattle Times. Retrieved from https://www.seattletimes.com/business/real-estate/seattle-area-home-prices-this-spring-rose-at-fastest-rate-since-2006-bubble/